Skip to content

Refactor/tornadovm planning#117

Open
orionpapadakis wants to merge 6 commits into
mainfrom
refactor/tornadovm-planning
Open

Refactor/tornadovm planning#117
orionpapadakis wants to merge 6 commits into
mainfrom
refactor/tornadovm-planning

Conversation

@orionpapadakis
Copy link
Copy Markdown
Collaborator

This PR reorganizes TornadoVM execution planning around three variant axes:

  • model family
  • quantization
  • forward execution mode

The previous structure was mainly shaped around two axes: model family and quantization. With prefill-decode and batch-prefill-decode, execution mode becomes a third axis, which greatly increases the number of
combinations each model/quantization pair may need to support.

This refactor introduces forward plans, task-graph layouts, and model/quantization component providers so single-token, prefill-decode, and batch-prefill-decode paths can share one cleaner planning structure
instead of growing separate master-plan dispatch logic.

Notes

  • Adds Llama Q8_0 prefill-decode support which also exhibits the necessity of this PR.
  • Renames task-graph abstractions for clearer roles.
  • Moves scheduling helpers into a dedicated TornadoVM scheduling package.
  • Keeps graph topology and execution behavior unchanged outside the new prefill-decode path.

Verification

  • use java 21 or 25

  • setup tornadovm

  • mvn clean install

  • llama fp16 (single-token):
    ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048

  • llama fp16 (prefill-decode):
    ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode

  • llama fp16 (batch-prefill-decode):
    ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32

  • llama fp16 (batch-prefill-decode-CUDA_GRAPHS):
    ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32 --cuda-graphs

  • llama q8_0 (single-token):
    ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048

  • llama q8_0 (prefill-decode):
    ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode

  • llama q8_0 (batch-prefill-decode):
    ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32

  • llama q8_0 (batch-prefill-decode-CUDA_GRAPHS):
    ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32 --cuda-graphs

any other model (mistral, qwen3 etc) should also pass with single-token config BUT should fail for any prefill-decode config with the following message:

WARNING: Using incubator modules: jdk.incubator.vector
Exception in thread "main" java.lang.UnsupportedOperationException: BATCH_PREFILL_DECODE not yet supported for QWEN_3 + F16
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.createQwen3FP16Plan(ForwardPlanFactory.java:174)
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.createFP16Plan(ForwardPlanFactory.java:90)
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.create(ForwardPlanFactory.java:74)
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.createBatchPrefillDecode(ForwardPlanFactory.java:65)
  at org.beehive.gpullama3.tornadovm.TornadoVMMasterPlanBatchPrefillDecode.createExecutionPlan(TornadoVMMasterPlanBatchPrefillDecode.java:70)
  at org.beehive.gpullama3.tornadovm.TornadoVMMasterPlanBatchPrefillDecode.<init>(TornadoVMMasterPlanBatchPrefillDecode.java:51)
  at org.beehive.gpullama3.tornadovm.TornadoVMMasterPlan.initializeTornadoVMPlan(TornadoVMMasterPlan.java:59)
  at org.beehive.gpullama3.model.Model.runInstructOnce(Model.java:205)
  at org.beehive.gpullama3.LlamaApp.runSingleInstruction(LlamaApp.java:18)
  at org.beehive.gpullama3.LlamaApp.main(LlamaApp.java:44)
Error: Command failed with return code 1

Reorganize TornadoVM execution planning around forward modes, model families, and quantization-specific components.
…d `AbstractLogitsLayer` to `AbstractLogitsTaskGraph`, updating all references to improve clarity and align with naming conventions.
Comment on lines +157 to +162
MemorySegment tokenEmbeddings = weights.getTokenEmbeddingTable().asByteArray().getSegment();
int blocksPerToken = (configuration.dim() + 31) / 32;
long bytesPerToken = (long) blocksPerToken * 34;
MemorySegment.copy(tokenEmbeddings, (long) token * bytesPerToken,
state.embeddingX.getSegment(), 0, bytesPerToken);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should be a method on each own. Same for the above

}

// ── Q8_0 Batch Kernels ───────────────────────────────────────────────────

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format is odd. use @Formatter: on / off of the block and pass the autoformatter

}

@Override
protected String predecessorGraphName(int layerIndex) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again formatter - use annotations eitherwise in the first autoformatitng pass it will be got flat.


// ── Embedding preparation ─────────────────────────────────────────────────

@Override public EmbeddingPreparer embeddingPreparer() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add javadoc as this a new functionality no one else knows what it does.

}

@Override public ActivationTaskGraph standardActivation() {
return new Activation("activationUpdate", state, weights, config);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe 'actiovationUpdate' and 'logits' strings should be in an enum or record that reuse that instead of have these Strings all over the place.

Copy link
Copy Markdown
Member

@mikepapadim mikepapadim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some minor changes needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants